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Abstract 



In Lhiri p!iper, wt! deri(;rib*i a, anL of lutilricH lor 
the evaluation of different dialogue management 



By taking into account the shortcoming s of the re- 
cent work d o ne on subjective evaluat ion (Simpson & 
Eraser 1993), (Hirschman & Pao 1993), this paper has 



strategies in an implemented real-time spoken lan- 
guage system. The set of metrics we propose tries 
to offer useful insights in evaluating how particular 
choices in the dialogue management can affect the 
overall quality of the man-machine dialogue. The 
evaluation makes use of established metrics: the 
transaction success, the contextual appropriate- 
ness of system answers, the calculation of normal 
and correction turns in a dialogue. We also define 
a new metric, the implicit recovery, which allows 
to measure the ability of a dialogue manager to 
deal with errors by different levels of analysis. We 
report evaluation data from several experiments, 
and we compare two different approaches to dia- 
logue repair strategies using the set of metrics we 
argue for. 

Introduction 

A dialogue module which is part of a complex natural 
language system (for example, of a speech understand- 
ing system providing information) may be evaluated ac- 
cording to different viewpoints, the more important of 
which are: 

(1) its ability to drive the user to find the required 
information; 

(2) the overall quality of the dialogic interaction; 

(3) its capacity to maintain an acceptable level of 
interaction with the user, also when other modules have 
partial or total breakdowns. 

The first point may be measured in terms of the suc- 
cess of the dialogic transaction, but the second and the 
third points are rather matters of subjective evalua- 
tion. This dichotomy is reflected in the state of the art 
of dialogue evaluation methods. While there is a set of 
objective metrics which can be used to measure the per - 
formance of a dialogue system (Hirschman et al. 1990), 
only in the last few years an effort has been done to 
define metrics which express the subjective evaluation 
of dialogue systems. 



Copyright © 2008, American Association for Artificial In- 
telligence (www.aaai.org). All rights reserved. 



the goal of arguing for a set of metrics which can be 
fruitfully used to compare the behaviour of different di- 
alogue strategies within speech systems. In particular, 
we introduce a new metric (implicit recovery) which 
captures the ability of the dialogue manager to recover 
from partial or total failure of previous levels of analysis. 
Other metrics we used (i.e. contextual appropriateness, 
turn correction ratio, transaction success) derive from 
the set of evalu ation metrics defined within the Sundial 
Esprit project ( Danieli ct al. 1992 ): in what follows 
they are discussed only to point out possible difference 
in their interpretation. 

We will show the application of these metrics to eval- 
uate data coming from two trials carried out on the spo- 
ken man-machine dialogue system developed at CSELT 
for the Italian language. The application domain of 
the system is the Italian railway time-table; the sys- 
tem allows to access a remote database using the tele- 
phone and to get information about the train times and 
services. During the experimentation two different ap- 
proaches to dialogue management were tested and eval- 
uated. This methodology of evaluation allows to com- 
pare the different effects of the two approaches on the 
system behaviour. 

Metrics and Methods of Evaluation 
Implicit Recovery 

In evaluating a dialogue strategy for a spoken language 
system, attention should be paid to capture its capacity 
to deal with situations in which errors occur. In partic- 
ular, we expect that a well conceived dialogue system 
should be able to repair from both partial and total 
failure by the previous levels of analysis. That feature 
may be considered according to different points of view: 
on one hand the dialogue manager should be able to fil- 
ter the parser output and to interpret it starting from 
contextual knowledge. On the other hand, it should 
embody explicit strategies to recover from understand- 
ing or recognition errors. While the latter ability may 
be evaluated in terms of the number of correction turns 



undertaken by system and user (see below), we define 
the implicit recovery (IR) as a measure of tlie former 
ability. 

The IR is the measure of the dialogue module ca- 
pacity to regain utterances which are partially failed at 
recognition or understanding levels. When the linguis- 
tic processor performs a robust partial parsing, the dia- 
logue module may receive either a correctly representa- 
tion of the utterance conceptual content, or only partial 
results. Of course, in spite of robustness of the parser, 
complete failure in understandings may also occur. Ut- 
terances which have been partially misunderstood may 
have insertions of concepts which are not present in the 
original utterance, deletions of some concepts or sub- 
stitutions of the value of a concept with another one. 
The dialogue module should be able to deal with this 
kind of errors by interpreting them within the dialogue 
context. 

In order to measure the IR, we need a semantic rep- 
resentation formalism that allows to calculate the per- 
centage of correctly understood concepts. In evaluat- 



ing ;he experimental data (see the fourth paragraph), 
we used the conceptual accuracy metric ((JonA) at the 
syntactico-semantic level. |^ 

To apply the IR metric, an expert examines the di- 
alogue logfiles and for each user's utterance, he checks 
if the semantic representation of the utterance mean- 
ing given by the parser is correct or not. No IR occurs 
if the utterance has been correctly understood or com- 
pletely failed. Otherwise the expert sees if the error has 
been recovered, i.e. an appropriate answer is given by 
the system. ^ This case is marked to be an IR. The 
IR final result is the percentage between the number 
of cases where the dialogue manager was able to cor- 
rect the conceptual errors and the number of sentences 
which presents conceptual errors. 



Ul: I want to go from Roma to Milano in the morning. 
<arrival-city=MILANO, departure-time=MORNING> 
SI: Sorry, where do you want to leave from? 
U2: From Roma. 

<departure-city=ROMA, cost-of-ticket?> 

S2: Do you want to go from Roma to Milano leaving 

in the morning? 



Figure 1 : Example of Implicit Recovery 

Figure ^ shows an example of dialogue interaction 
where two IR occur. In the first dialogue turn, the 
user's utterance contains all the concepts the system 
needs to retrieve the desired information, but the recog- 



^This is a metric which uses at the understanding level 
the word accuracy formula, expressed in terms 
delet ion and substitution of concepts, as in ([Baggia et al.| 



nition (or parsing) level fails to represent the depar- 
ture city. The dialogue takes into account the correctly 
understood concepts and asks for the concept which 
was lost. In the second turn, the recognition level in- 
serted some words in the best decoded sequence that 
the parser interprets as a request of the cost of ticket. 
But since for the dialogue strategy that concept is not 
relevant in the current context, the system does not 
consider it and asks the user to confirm only the cor- 
rect concepts it has been able to collect. In similar cases 
we would say that the IR percentage is 100%. 

Other Metrics 

The contextual appropriateness is a measure of the de- 
gree of contextual coherence of the system answers. The 
concept of contextual appropriateness (C A) is taken 
from the Grice's conversational maxims ( Grice 1967 ) 



and it has been used within the Sundial project to evalu 
ate the appropriateness of system utterances in their di 
alogue context. We have restricted th e definition of con- 
textu al appropriateness proposed in (Simpson & Fraser 



1993) to obtain a three- valued measure: appropriate. 



1994) 



For the definition of contextual appropriateness see the 
next paragraph. 



inappropriate and ambiguous. 

According to this restricted interpretation, we say 
that a system utterance is appropriate (AP) when it 
provides the user with the information he required, 
when it asks him to give additional constraints which 
are essential to interpret his request or when it intro- 
duces (or continues) a repair strategy. A system ut- 
terance is inappropriate (lA) when it supplies the user 
with wrong information or when it fails to interpret the 
speaker's utterance in the correct context. Finally, a 
system utterance is ambiguous (AM) when it violates 
the Gricean maxims of quantity and manner, i.e. it is 
over (or under) informative, it is obscure and it is not 
orderly and brief. 

During the implementation of the dialogue systems 
we saw that it was useful to measure the ratio of those 
turns which are concerned with anomalous behaviour 
from both the user and the system to all turns in a 
dialogue; we named this measure turn correction ratio 
(TCR). The TCR is calculated adding the resuhs of 
the application of two submetrics: the turn correction 
by the system (STC) and the turn correction by the 
user (UTC). The STC concerned those dialogue turns 
where the system introduces a recovery strategy and 
tells the user to repeat or rephrase his sentence. The 
UTC occurs when the user detects or corrects an error, 
repeats or rephrases an utterance. 

All the turns which are neither STC nor UTC are 
considered n ormal turns: by followin g the classification 
proposed in ( Hirschman fc Pao 1993| ), we consider nor- 
mal turns of the system the appropriate directives, such 
as the introductory message, the appropriate diagnos- 
tic messages and the correct answers. The normal turns 
of the user are the utterances used to request informa- 
tion (both first and continuation utterances), and the 
answers to appropriate system directives. 

Finally, we used the concept of transaction success 



(TS) to measure the success of the system in providing 
the speakers with the information they required, when 
such information is available in its database. 

Methodology 

The system configuration permitted to store all the data 
collected in the tests: the speech material, the semantic 
representation of the sentences (parser output), the dia- 
logue logfile (user/system interactions) and some timing 
(recognition time, parser time and dialogue time). All 
the speech material had to be manually transcribed; 
the dialogue corpus evaluation was per formed by two 
experts on the dialogue logfiles as in (Goodinc ct al 
199|). 

Subjects' global impressions were collected by asking 
subjects to complete a questionnaire . Comments on 
questionnaire are in (Ciaramella 1993). 



D escription of the Experimental Set-up 



Two different trials were carried out along three months 
on an integrated spoken man-machine dialogue system 
which allows the access to a remote DB using the tele- 
phone. This prototype was partially developed under 
the Sundial Esprit Project. The application domain 
consists of the Italian train time-table information. The 
first trial was carried out in March 1993 and the second 
one in May 1993. For the first trial, twenty subjects 
were recruited among people who have never used a 
computerized telephone service before. Those subjects 
were paid for testing the system; ten out of them were 



femijile and leii were luale; the average age of Lhe sub- 



ject ^ was 37. 

For the second trial, fifteen people were recruited 
among CSELT staff: eleven out of them were male, 
four were female. The average age of the male subjects 
was 35, that of the female subjects was 30. 

The subjects came to CSELT laboratories and re- 
ceived a single page of printed directions which con- 
tained a brief explanation of the service capabilities 
and some instructions (e.g.: "Please, speak after the 
tone"). All the subjects carried out the test being 
alone in an isolated room. During the dialogue with 
the system they had to get information about train 
time-table and related services (sleeping-cars, restau- 
rant, rates and extra-fares, reservation and so on). 

To precisely determine if the task has been solved, 
predefined pictorial scenarios were used. Each scenario 
specified the departure and arrival city names, chosen 
among the set of 100 cities of the railway DB in use, and 
the train attributes to be collected during the dialogue, 
while the user was free to specify the departure time. 

In both trials each subject had to play at least 4 sce- 
narios; the corpus of dialogues collected in the tests are 
shown in Table |l|. For each trial, the total number of 
dialogues, the number of continuous speech utterances, 
and the average number of words per utterance are re- 
ported. 



Trial 


No. of 


No. of 


No. of 


Avg. words 




Subj. 


Dial. 


Utt. 


per Utt. 


1st 


20 


85 


678 


4.8 


2nd 


15 


63 


464 


4.2 



Table 1: Dialogues corpus characteristics 



Overview of the System Architecture 

The system is composed by the following modules: 
the acoustical front-end (AFE), the linguistic proces- 
sor (LP), the dialogue manager and message generator 
(DM) , and the text-to-speech synthesizer. The acousti- 
cal front-end and the synthesizer are interconnected to 
the PBX through a telephone interface, while the dia- 
logue manager is connected to a Computer Information 
System to obtain the information on Italian train time- 
tables. The system is nearly re al time. For a complete 
descr iption of the system see (Clementino & Fissore 



1993) 



The AFE performs feature extraction and acoustic- 
phonetic decoding; both DDHMM and CDHMM are 
supported and the vocabulary s ize is about 800 words 
( Fissore, Laface, and Micca 1991 ) . In these experiments 
we used a preliminary version of the recognition mod- 
ule which used always DDHMM with Forward decoding 
algorithm and made no use of linguistic models. 

The LP starts from the AFE output, best-decoded se- 
quence, and performs a multi-step robust partial pars- 
ing. In this strategy, partial solution s are accepted ac- 
cordi ng to the linguistic knowledge (Baggia & RuUent 



1993). At the end of the parsing stage a deep semantic 
representation for the user utterance is sent to the DM. 

The DM models the user-system interaction and con- 
textual ly interprets sentences us ing a prediction mech- 
anism ( Perbino fc Danieli 199^ ). The DM also imple- 
ments many recovery strategies in order to allow com- 
pletion of the task, even in case of lower level errors. Fi- 
nally, it generates the answer sent to the text-to-speech 
synthesizer, which con tains specific proso dic rules ori- 
ented to the dialogue ( Quazza et al. 1993| ). 

During the trials, two ditt'erent versions of the DM 
(Dl and D2) were tested by all subjects; each carried 
out two telephone calls with one version and two calls 
with the other one. 

The Dl version was totally developed at CSELT and 
it is currently used in the integrated system. The D2 
version was developed under the Sundial Esprit Project. 

Figure ^ shows an example of dialogue interaction 
carried out with the Dl version, while Figure |^ shows 
an example of dialogue interaction carried out with the 
D2 version. As we can observe from the dialogue exam- 
ples, the more salient differences between the two DM 
versions concerned the kind of information supplied and 
the repair strategies. 

The Dl version provides the opportunity to ask in- 
formation about train times and services, but it does 
not leave the user free to make requests about more 



SI: Hello, This is Train Enquiry Service. Please speak 
after the tone. Which information do you need? 
Ul: I want to go from Torino to Milano. 
<departure-city=TRENTO, arrival-city=MILANO> 
S2: Do you want to go from Trento to Milano? Yes 
or No? 
U2: No. 

<confirmation=NO> 

S3: Do you want to leave from Trento? Yes or No? 
U3: No. 

<confirmation=NO> 

S4: Where do you want to leave from? 

U4: I want to leave from Torino. 

<departure-city=TORINO> 

S5: Do you want to leave from Torino? Yes or No? 
U5: Yes. 

< confirmation^ YES > 

S6: Do you want to go to Milano? Yes or No? 
U6: Yes. 

< confirmation^ YES > 

S7: At which time do you want to leave? 
U7: I want to travel in the evening. 

< departure- time=E VENING > 

S8: Do you want to leave between 6 and 9 p.m.? 
Yes or No? 
U8: Yes. 

< confirmation^ YES > 

S9: There is a train from Torino Porta Nuova to Milano 
Centrale. It leaves at 7.10 p.m. and arrives at 8.55 
p.m.. Do you want more detailed information about 
it? 

U9: Is that an intercity train and how much does the 
ticket cost? 



Figure 2: Example of Dl dialogue interaction 



than one train route in a single telephone call. On the 
contrary, the D2 version allows the request of infor- 
mation about many train routes in a single telephone 
call, but it does not provide the caller with information 
about the train services. Moreover the Dl confirma- 
tion and repair strategies are specifically implemented 
to deal with possible speech errors. Actually the Dl 
(see Fig. |^) implements an explicit concept confirma- 
tion strategy forcing the user to answer with isolated 
words (yes on no) ; this strategy is robust and safe even 
if it increases the number of turns spent in confirma- 
tions and consequently the dialogue time. Besides, the 
Dl strategy makes use of more robust speech interac- 
tion modalities such as isolated word and spelling. 

The D2 system implements discourse strategies which 
are more confident in the capabilities of the acoustic 
and linguistic processors. Actually, D2 is able to sup- 
port three confirmation strategies: confirmation alone 
for a bunch of concepts; confirmation for a bunch of con- 
cepts plus initiative; confirmation concept by concept, 
and then initiative. However the dialogue manager is 
not able to switch autonomously from one strategy to 
another when it detects troubles with the communica- 



Sl: Hello, This is Train Enquiry Service. Please speak 
after the tone. Which information do you need? 
Ul: I want to travel from Milano to Roma. 
<departure-city=MERANO, arrival-city=ROMA> 
S2: At which time do you want to leave from Merano 
to Roma? 

U2: No, I want to leave from Milano in the evening. 
<confirmation=NO, departure-city=MILANO, 
departure-time=EVENING> 

S3: Do you want to leave from Milano between 6 and 
11 p.m.? Please answer Yes or No? 
U3: Yes. 

< confirmation= YES > 

S4: Intercity 243 leaves from Milano Centrale at 20 
past 8 p.m.; it arrives at Roma Termini 6 o'clock a.m.. 
Do you want another train information? 
U4: No 

<confirmation=NO > 
S5: Good bye. 



Figure 3: Example of D2 dialogue interaction 



tion. When an error occurs, D2 decides to enter a spe- 
cial mode: after three requests for repetition, the last 
one using the spelling modality, the system advises the 
user to contact a human operator. 

An example of D2 multiple confirmation plus initia- 
tive strategy is shown in Figure There the Ul ut- 
terance is misunderstood at the recognition level. The 
dialogue module implicitly asks for confirmation of de- 
parture and arrival cities by asking for the desired de- 
parture time (see S2). In U2 the subject denies the 
departure city proposed by the system, reconfirms that 
he wants to leave from Milano and gives the system the 
departure time. System utterance S3 shows that D2 
considers the arrival city as implicitly confirmed and 
carries on the interaction by focusing on the new ac- 
quired concepts. During the testing of this system we 
chose to run it with a confirmation strategy which did 
not forced the subjects to have recourse to isolated word 
recognition. 

Evaluation Results 

The dialogue corpus collected in the trials was anal- 
ysed according to the whole set of evaluation metrics. 
At the recognition and understanding levels, users' ut- 
terances were evaluated by considering the standard 
measurements: respectively, the Word Accuracy (WA) 
calculated on the best decoded sequence against the 
transcribed uttered sentence, and the Sentence Under- 
standing (SU). In the first trial the results were: 52.1% 
of WA and 50.9% of SU. In the second trial the results 
were: 60.2% of WA and 59.1% of SU. | As regards these 
data, we did not distinguish between the two different 



^R ecent recognition results are available in (Giachin 
1995). Now the system obtains 82.6% of WA by using lin- 



guistic models at the recognition level. 



DM versions because they are related to the two com- 
mon system modules (AFE and LP). 

As regards the dialogue level, we calculated contex- 
tual appropriateness (CA), explicit recovery (ER) and 
implicit recovery (IR). Moreover we distinguished be- 
tween the two DM versions, in order to study the ca- 
pability of these metrics to point out the differences 
between various dialogue strategies. Table || shows the 
results obtained in the trials for the contextual appro- 
priateness. The first column reports the percentage of 
the appropriate sentences uttered by the system; the 
second column reports the percentage of the inappro- 
priate sentences and the third column shows the per- 
centage of ambiguous utterances. As we can see, both 
the dialogue systems are seldom ambiguous; that say 
us that the generation modules of both the systems are 
good. 



Trial 




CA 






AP 


lA 


AM 


1st Dl 


77.6% 


20.6% 


1.8% 


1st D2 


49.2% 


50.3% 


0.5% 


2nd Dl 


79.1% 


19.3% 


1.6% 


2nd D2 


56.5% 


43.5% 


0.0% 



Table 2: Contextual Appropriateness Results 

We deem that the contextual appropriateness metric 
is useful to evaluate the quality of the dialogic interac- 
tion and to address the issue of co-operation in human 
computer dialogue. We can expect that in a ideally 
perfect speech system, where no recognition and under- 
standing errors occur, the CA should measure properly 
the DM capability to correctly interpret user's utter- 
ances. Starting from the same percentage of correctly 
understood utterances (respectively 50.9% and 59.1% 
in the two trials), Dl and D2 get very different CA 
scores. In particular, AP results reflect the greater ro- 
bustness of the Dl version when it faces off difhculties 
at the recognition or understanding level. 

Another data which stands out is the growth of the 
percentage of AP when the users are good conversation- 
alists with the computer, as the subjects participating 
to the second trial were. The value of AP increases 
more for the D2 dialogue system: that means that the 
dialogue strategies it implements are more sensitive to 
the performances of the previous levels of analysis. 



Trial 


E 

UTC 


R 

STC 


IR 


1st Dl 
1st D2 


24.8% 
67.9% 


31.8% 
65.6% 


17.0% 
10.8% 


2nd Dl 
2nd D2 


25.6% 
45.0% 


22.5% 
49.2% 


17.0% 
10.7% 



Table 3: Recovery Results 



Table g shows the results obtained in the trials for 
the metrics which measure the recoveries from errors 
implemented both by the systems and by the subjects. 
If we consider the ER data, we can read in the first 
column the percentage of correction turns done by the 
users (UTC), while the percentage of correction turns 
by the systems (STC) is reported in the second column. 
The experiments highlight that if a dialogue strategy is 
not robust enough to deal with errors by the lower levels 
(see D2 data) , the number of turns spent by both user 
and system in repairing from errors grows up. When 
users are more co-operative, as staff subjects were, the 
percentage of STC and UTC decreases. 

The third column of Table ^ reports the percentage 
of turns in which the dialogue systems implicitly re- 
covered from errors by recognition and understanding. 
Those data show that the capacity of implicit recovery 
of Dl and D2 does not vary from naive to expert users: 
actually, IR is a measure of a dialogue system ability 
and it does not depend upon the degree of users' co- 
operation. Since Dl and D2 make use of different de- 
grees of predictive contextual knowledge, we expected a 
difference in their IR performance and that is shown by 
the data. The different performance is also due to the 
fact that Dl applies its predictive knowledge in more 
and more focused interpretation contexts as far as the 
dialogue goes on. For example, the recourse to the re- 
quest for a single concept, see Figure turns S4 and 
S7, allows using focused predictive knowledge. 

On the contrary, the interpretative focus of D2 is al- 
ways wider, so that it cannot exploit the advantages of 
very constrained contextual interpretation, which seems 
to be useful in this kind of application of discourse anal- 
ysis. Let us consider the use of implicit confirmation 
strategy showed in Figure 0, turn S2. There the reply 
to system enquiry by the subject may contain a great 
deal of information, for instance the negation of what 
the system said along with the introduction of new con- 
cepts (see turn U2). In this case, the use of very con- 
strained predictive knowledge is hardly possible. 

Table ^ shows the whole system performance: the 
percentage of TS, the average number of turns per dia- 
logue, the average dialogue time, and the TCR. 



Trial 


TS 


Avg. No. 


Avg. Dial. 


TCR 






of Turns 


Time 




1st Dl 


77.6% 


20 


5'15" 


10.0% 


1st D2 


51.0% 


11 


3'20" 


27.0% 


2nd Dl 


96.6% 


21 


5'09" 


9.5% 


2nd D2 


83.3% 


11 


2'59" 


15.0% 



Table 4: Whole System Performances 

The TS is always good, but it increases as the users 
are more friendly or as much as the acoustic and lin- 
guistic processors have better performance. 

Finally, we notice that the number of turns and the 
dialogue time are higher with Dl: this difference is due 



to the fact that Dl allows the request of many infor- 
mation about train services and it does not close the 
interaction if there are difficulties in recognition or un- 
derstanding. 

Conclusions 

The results of these experiments arc encouraging as re- 
gards the effectiveness of the metrics we used. We have 
argued that it is important to capture the ability of a 
dialogue system to reduce the consequences of recogni- 
tion and understanding errors. 

The necessity of many specific metrics is due to the 
fact that various dialogue strategy aspects have to be 
evaluated. Actually, at least three aspects have to be 
measured: the dialogue system ability to drive the user 
to find the desired information is captured by measur- 
ing the transaction success along with the average num- 
ber of turns in the dialogue, while the quality of the 
man-machine interaction is measured by the metric of 
contextual appropriateness. Finally, the dialogue sys- 
tem robustness is evaluated by measuring its ability to 
perform both implicit and explicit recoveries when the 
lower levels of the system fail. This set of metrics also 
enables the dialogue system designer to verify the suc- 
cess of alternative strategics. According to us, in this 
field another important research topic should be the 
definition of methods for evaluating the system effec- 
tiveness and friendliness from the user's point of view. 

Before concluding this paper, we would like to thank 
Sheyla Militello for her help during the experimentation 
activity. 
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